Doc on handling worker with walltime by guillaumeeb · Pull Request #481 · dask/dask-jobqueue

guillaumeeb · 2021-01-23T14:51:04Z

Finally, a little contribution from me, and a doc fix to a long standing issue.

Fixes #122.

mivade

Thanks, this looks like great additional documentation! I've pointed out some typos and some suggestions to clarify the language a bit, but otherwise this looks good!

mivade · 2021-01-23T17:34:48Z

+- when you don't have a lot of room on you HPC platform and have only a few workers at a time (less than what you were hopping for when using scale or adapt). These workers will be killed (and others started) before you workload ends.
+- when you really don't know how long your workload will take: all your workers could be killed before reaching the end. In this case, you'll want to use adaptive clusters so that Dask ensures some workers are always up.
+
+If you don't set the proper parameters, you'll run into KilleWorker exceptions in those two cases.


Typo in the exception.

mivade · 2021-01-23T17:35:04Z

+
+If you don't set the proper parameters, you'll run into KilleWorker exceptions in those two cases.
+
+The solution to this problem is to tell Dask up front that the workers have a finit life time:


Typo: finit -> finite. Similarly lifetime is usually spelled as a single word.

mivade · 2021-01-23T17:35:52Z

+
+The solution to this problem is to tell Dask up front that the workers have a finit life time:
+
+- Use `--lifetime` worker option. This will enables infinite workloads using adaptive. Workers will be properly shut down before the scheduling system kills them, and all their states moved.


enables -> enable

mivade · 2021-01-23T17:36:52Z

+How to handle job queueing system walltime killing workers
+----------------------------------------------------------

+In dask-jobqueue, every worker processes run inside a job, and all jobs have a time limit in job queueing systems.


Should be "every worker process runs..."

mivade · 2021-01-23T17:37:24Z

+In dask-jobqueue, every worker processes run inside a job, and all jobs have a time limit in job queueing systems.
+Reaching walltime can be troublesome in several cases:
+
+- when you don't have a lot of room on you HPC platform and have only a few workers at a time (less than what you were hopping for when using scale or adapt). These workers will be killed (and others started) before you workload ends.


hopping -> hoping and "before you workload" -> "before your workload"

mivade · 2021-01-23T17:39:06Z

+The solution to this problem is to tell Dask up front that the workers have a finit life time:
+
+- Use `--lifetime` worker option. This will enables infinite workloads using adaptive. Workers will be properly shut down before the scheduling system kills them, and all their states moved.
+- Use `--lifetime-stagger` when dealing with many workers (say > 20): this will allow to avoid workers all terminating at the same time, and so to ease rebalancing tasks and scheduling burden.


"this will allow to avoid workers all" -> "this will prevent workers from"

"and so to ease" -> "and so ease" or (probably better) "thus"

mivade · 2021-01-23T17:39:28Z

+    cluster.adapt(minimum=0, maximum=200)
+
+
+Here is an example of a workflow taking advantage of this, if you wan't to give it a try or adapt it to your use case:


wan't -> want

guillaumeeb · 2021-01-23T20:19:08Z

Many thanks @mivade! I need to practice my english...

andersy005

Thank you for putting this together, @guillaumeeb!

guillaumeeb added 2 commits January 23, 2021 14:47

Doc on handling worker with walltime

a2e4198

Improving inlining

0fb0a56

mivade reviewed Jan 23, 2021

View reviewed changes

Fix typos

dc41575

andersy005 approved these changes Jan 24, 2021

View reviewed changes

andersy005 added the documentation Documentation-related label Jan 24, 2021

guillaumeeb merged commit 69f27ac into dask:master Jan 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Doc on handling worker with walltime#481

Doc on handling worker with walltime#481
guillaumeeb merged 3 commits into
dask:masterfrom
guillaumeeb:update_docs_handling_workers

guillaumeeb commented Jan 23, 2021

Uh oh!

mivade left a comment

Uh oh!

mivade Jan 23, 2021

Uh oh!

mivade Jan 23, 2021

Uh oh!

mivade Jan 23, 2021

Uh oh!

mivade Jan 23, 2021

Uh oh!

mivade Jan 23, 2021

Uh oh!

mivade Jan 23, 2021

Uh oh!

mivade Jan 23, 2021

Uh oh!

guillaumeeb commented Jan 23, 2021

Uh oh!

andersy005 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		If you don't set the proper parameters, you'll run into KilleWorker exceptions in those two cases.

		The solution to this problem is to tell Dask up front that the workers have a finit life time:


		The solution to this problem is to tell Dask up front that the workers have a finit life time:

		- Use `--lifetime` worker option. This will enables infinite workloads using adaptive. Workers will be properly shut down before the scheduling system kills them, and all their states moved.

		cluster.adapt(minimum=0, maximum=200)


		Here is an example of a workflow taking advantage of this, if you wan't to give it a try or adapt it to your use case:

Uh oh!

Uh oh!

Conversation

guillaumeeb commented Jan 23, 2021

Uh oh!

mivade left a comment

Choose a reason for hiding this comment

Uh oh!

mivade Jan 23, 2021

Choose a reason for hiding this comment

Uh oh!

mivade Jan 23, 2021

Choose a reason for hiding this comment

Uh oh!

mivade Jan 23, 2021

Choose a reason for hiding this comment

Uh oh!

mivade Jan 23, 2021

Choose a reason for hiding this comment

Uh oh!

mivade Jan 23, 2021

Choose a reason for hiding this comment

Uh oh!

mivade Jan 23, 2021

Choose a reason for hiding this comment

Uh oh!

mivade Jan 23, 2021

Choose a reason for hiding this comment

Uh oh!

guillaumeeb commented Jan 23, 2021

Uh oh!

andersy005 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants